NLP

Adversarial Active Learning for Sequence Labeling and Generation

本文发表在IJCAI2018上,主要是关于active learning在序列问题上的应用,现有的active learning方法大多依赖于基于概率的分类器,而这些方法不适合于序列问题(标签序列的空间太大),作者提出了一种基于adversarial learning的框架解决了该问题。

paper

Introduction

Active Learning from wikipedia) : Active learning is a special case of machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal experimental design.

简言之,Active Learning是用来解决监督学习中标注样本的缺乏问题,现有的大多数Active Learning方法都是基于概率的分类器实现的,通过分类器预测的概率分布来衡量一个无标注样本的不确定性,如果一个无标注样本的不确定性很高,则证明这个样本包含对当前分类器的有效信息,选出这个样本进行标注,这个过程叫做query sample selection

而对于序列问题来说,上述方式的计算复杂度过高:

Consider a label sequence with p tokens and each token can belong to k possible classes, then there are $k^{p}$ possible combinations of the label sequence. This complexity can grow exponentially with the length of the output.

而本文提出的adversarial active learning model for sequences (ALISE) 则使用对抗学习代替了该过程:

The proposed adversarial active learning framework incorporates a neural network to explicitly assert each sample’s informa-tiveness with regard to labeled data.

Background: Active Learning for Sequences

现有的针对序列问题的active learning方法有以下几种度量不确定性的计算方式:

  1. least confidence (LC) score: $y^{*}$ 是未标注样本$x^{U}$最有可能的预测结果(实际是一个标签序列),一般通过维特比算法计算得到最大概率的标签序列。

  2. margin term: $y^{_}_{1}, y^{_}_{2}$分别是第一和第二高概率的标签序列。

  3. 序列的交叉熵(这里的交叉熵是指标签序列的概率分布与其本身的交叉熵,实际上等于其自身的熵$H(p,q)=H(p)+KL(p,q), KL(p,p)=0$): $y^{p}$ 是所有可能的标签序列

    实际中为了减小计算量,选取前N个概率最大的标签序列(可以通过Beam Search)N-best sequence entropy (NSE)。

上面三种均为计算不确定性的方法,得到未标注样本的不确定性后,优先选取高不确定性的样本进行标注。

The labeling priority should be given to sam-ples with high entropy (corresponding to low confidence).

以上方法都面临着以下问题:

When the candidate samples’ quantity is large, the calculation of such complexity uncertainty measures can take a quite long while in scoring all individual samples from the data pool.

Adversarial Active Learning for Sequences

作者首先定义了未标注样本$x^{U}$与已标注样本集$X^{L}$之间的匹配度:

基于上式对所有的未标注样本进行排序,进而选择排序靠前的样本进行标注:

A small similarity score implies the certain unlabeled sample is not related to any labeled samples in training set and vice versa. The labeling priority is offered to samples with low similarity scores.

作者提出了下列结构来定量计算匹配度:
Figure  1:  An  overview  of  Adversarial  Active  Learning  for  sequences (ALISE).  The  black  and  blue  arrows  respectively  indicate  flows  for labeled  and  unlabeled  samples.

Encoder M(图中两个是同一个网络,共享参数)负责得到隐变量表征,Discriminator D负责区分M的隐变量表征是否来自于标注样本(1代表为已标注,0代表未标注)。

与GAN类似,训练过程主要分两步:

  1. Encoder&&Decoder:Mathematically, it encourages the discriminator D to output a score 1 for both $z^{L}$ and $z^{U}$ .

  2. Discriminator:

Therefore, the score from this discriminator already serves as an informativeness similarity score that could be directly used for Eq.7.

训练完成之后,将所有的未标注数据通过M和D,来获得匹配度:

Apparently, those samples with lowest scores should be sent out for labeling because they carry most valuable information in complementary to the cur-rent labeled data.

尽管ALISE模型并不依赖于decoder计算标签序列的概率来得到不确定性的度量,但二者可以相结合,作者把生成概率融合到ALISE框架中:首先通过判别器D得到前K个未标注样本,再使用decoder计算生成概率从K个样本中选取得到前k个不确定性高的样本进行标注。

ALISE does not generate any fake sample and just borrows the adversarial learning objective for sample scoring.

Experiments

Slot Filling

Figure  3:  Image  captioning  results  by  active  learning

Encoder和Decoder均为基本的RNN,Discriminator是全连接网络。总共是3000个样本,每次迭代时选择其中的300个样本进行标注,Random代表随机选取,使用所有已标注的数据进行训练。当3000个样本全部标注,所有方法的结果理论上应该是相同的。

Image Captioning

Figure  4:  Image  captioning  results  in  the  active  learning  setting  by  ALISE,  ALISE+NSE  and  NSE-based  approaches.  The  novel  plausible descriptions  are  annotated  with  blue  color  while  wrong  descriptions  are  colored  in  red.

Computational Complexity:
Table  1:  The  active  selection  costs  for  different  algorithms

Conclusion

本文提出了一种对抗学习的sequence-based active learning框架,避免了传统的基于预测概率的方式,有效地提高模型的效率,并且可以应用到很多序列模型上。